feat: Migrating Update from the original firecrawl repository#41
Closed
dan-and wants to merge 48 commits intodevflowinc:mainfrom
Closed
feat: Migrating Update from the original firecrawl repository#41dan-and wants to merge 48 commits intodevflowinc:mainfrom
dan-and wants to merge 48 commits intodevflowinc:mainfrom
Conversation
- Add bulk scrape controller for processing multiple URLs - Add bulkScrapeRequestSchema and BulkScrapeRequest types - Add /v1/bulk/scrape POST and GET endpoints - Make originUrl optional in StoredCrawl type - Update queue worker to handle bulk scrape jobs - Add null safety for crawlerOptions in runWebScraper - Based on commit 03b3799 from firecrawl-original
- Update axios from ^1.3.4 to ^1.12.2 - Fixes CVE-2024-24691 and other security vulnerabilities - Based on commit 50343bc from firecrawl-original - No breaking changes, all functionality preserved
- Add content-type filtering to robots.txt parsing to prevent HTML error pages from being treated as robots.txt rules (ba3e4cd) - Fix URL validation regex to allow query parameters and fragments (cfd776a) Resolves issues where sites like JPMorgan Chase had robots.txt rules ignored and URLs with params were failing validation.
- Add scrapeId field to DocumentMetadata interface - Update legacyDocumentConverter to include scrapeId in API response - Modify scrape controller to pass crawl_id to job data for scrape jobs - Update single_url scraper to include scrapeId in document metadata - Fix queue worker to properly handle scrape vs crawl job distinction - Fix Docker build issues and align pnpm lockfile for reproducible builds Based on original commit d1f3b96 from the original firecrawl repository"
- Add ulimits configuration to docker-compose.yaml - Set nofile limits to 65535 (soft and hard) - Improves performance for high-concurrency web scraping - Prevents 'too many open files' errors - Synchronized with docker-compose.dev.yaml Based on original commit f0a1a2e from the originate firecrawl repository"
- Slice sitemap URLs to respect the specified limit - Add debug logging for sitemap URL limiting - Prevent creating excessive jobs from large sitemaps Based on: firecrawl-original commit c6ebbc6
- Add visited_unique Redis set for proper limit counting - Update lockURL and lockURLs functions to track unique URLs - Match original firecrawl architecture for limit handling - Ensure limit parameter is properly enforced during crawling Based on: firecrawl-original commit c6ebbc6
- Update V0 and V1 status controllers - Fix status logic and type definitions - Improve status handling reliability Based on: original firecrawl repository commit 6637dce
- Add Winston dependency for structured logging - Update TypeScript config to support ES2022 features - Replace basic logging with JSON format and metadata - Add zero data retention and error serialization - Update controllers and workers with contextual logging Based on original firecrawl repository commit 4a6b46d
- Replace console.log with Logger calls across all files - Remove commented out console.log statements - Enhance error logging with structured metadata - Improve logging consistency and professionalism Based on the original firecrawl commit 4c49bb9
- Increase file descriptor limit from 65535 to 1048576 - Apply ulimit to API, Worker, and Puppeteer services - Improve support for high-concurrency web scraping scenarios - Prevent "Too Many Open Files" errors under load Based on the original firecrawl repository commit f0a1a2e
Add advanced Redis memory management features: URL Deduplication: - Implement generateURLPermutations() for detecting similar URLs - Support www/http/https variations and common path permutations - Enable deduplicateSimilarURLs crawler option - Achieve (very optimistic) ~16x memory reduction for crawled URL tracking Connection Management: - Add getRedisConnection() with lazy loading pattern - Replace direct Redis instantiation across all services - Include connection monitoring and error handling - Optimize resource usage in concurrent environments Updated components: - Queue service, crawl Redis logic, job priority system - All crawl controllers (v0/v1) with enhanced URL locking - Worker processes with optimized Redis usage Based on original firecrawl commits 308e4f4 (connection opt) + 7611f81 (URL dedup) Impact: Significant memory savings + improved reliability
|
Hi! I'm also awaiting some updates on the firecrawl-simple... Does your pull request add either the Delete or Map endpoints? Or add the Crawl options ignoreQueryParameters or maxDiscoveryDepth? |
…rapers Add isPDFContent() to avoid treating PDF responses as HTML. When PDF is detected (signature, structure, or binary content), return empty content and a clear pageError instead of passing raw PDF to extraction. Co-authored-by: Cursor <cursoragent@cursor.com>
…ginal Tier 1 tasks (low-risk, concrete changes): - T1-B: Fix security vulnerabilities (glob, systeminformation, validator, minimatch, axios, undici, js-yaml, jsdiff, qs) - T1-A: Upgrade Node base image from 20 to 22 in Dockerfile, add python3 for ARM build - T1-C: Fix protocol in crawl/batch-scrape next URL for self-hosted deployments - T1-H: Require at least 1 URL in batch scrape schema - T1-I: Return 400 when all URLs are filtered out in bulk-scrape - T1-E: Filter non-HTTP(S) protocols in crawler link extraction - T1-D: Return 409 when cancelling an already-completed crawl - T1-J: Add SIGTERM handler and guard signal handlers with require.main === module - T1-F: Preserve SPA hash routes (#/ and #!/) in crawler link filtering - T1-G: Optimize Redis operations in crawl-redis.ts using pipelines - T1-M: Fix sitemap 404 handling in getLinksFromSitemap - T1-L: Fix excessive robots.txt fetching during crawls (add setBaseUrl method) All changes maintain backward compatibility and pass existing unit tests.
…RL resolution - Default replaceAllPathsWithAbsolutePaths to true so markdown links are absolute by default. - In replacePaths.ts, resolve root-relative paths (/) against origin and other relative paths (./, ../, bare) against the full source URL so relative links resolve correctly.
- Updated all uuid imports from v4 to v7 - Changed uuidv4() to uuidv7() in 11 files: - src/index.ts - src/controllers/v1/crawl.ts - src/controllers/v1/bulk-scrape.ts - src/controllers/v1/scrape.ts - src/controllers/v1/map.ts - src/controllers/v0/crawl.ts - src/controllers/v0/scrape.ts - src/controllers/v0/crawlPreview.ts - src/services/queue-worker.ts - src/__tests__/e2e_full_withAuth/index.test.ts - src/controllers/__tests__/crawl.test.ts - UUID package already at v10.0.0 (supports v7) - Benefits: time-ordered IDs for better debugging and Redis key ordering - No breaking changes (both v4 and v7 are valid UUIDs)
- Set includeMarkdown: false when calling scrapeSingleUrl from WebCrawler - Robots.txt and sitemap already use axios.get directly (skip scrape pipeline) - Crawler only needs rawHtml to extract links, markdown conversion is wasteful - Reduces CPU usage during crawling operations - No change to API output or user-facing behavior
- Add fetchTimeout constant (15s) for simple HTTP requests - Browser/Hero still uses 60s timeout for tab launch and JS execution - Servers that don't respond in 15s typically won't give useful content - Reduces wait time for timeout failures on plain fetch attempts
- Set attempts: 1 in default job options (prevent automatic retries) - Reduce maxStalledCount from 10 to 2 (detect hung jobs faster) - Reduces duplicate work when jobs hang and timeout
T1-N: Fix scraper fallback order (fetch first, Hero second) - Changed default scraper order from ["playwright", "fetch"] to ["fetch", "playwright"] - Static HTML pages now use fetch directly (3-5x faster, 95% memory reduction) - Hero only invoked as fallback when fetch fails (JS-rendered pages) - Verified with tests: example.com (523ms), httpbin.org (1047ms), heise.de (1058ms) T1-O: Replace axios with undici for 3-4x faster HTTP requests - Replaced axios.get() with undici.request() in scrapers/fetch.ts - Added undici as explicit dependency (already bundled in Node.js 22) - Added User-Agent header for better success rates - Properly consume body stream to prevent connection leaks - Handle undici-specific timeout codes (UND_ERR_HEADERS_TIMEOUT, UND_ERR_BODY_TIMEOUT) - Created comprehensive unit tests (9/9 passing) Bonus fix: queue-worker.ts null check - Added null check before accessing sc.cancelled property - Fixes "Cannot read properties of null (reading 'cancelled')" error Performance impact: - Static pages: ~3-5x faster (34-293ms fetch vs 400-800ms Hero) - HTTP requests: 3-4x faster (undici 18k-22k req/sec vs axios 5.7k req/sec) - Memory: ~95% reduction for static pages (no Hero browser needed)
T1-P: Add CycleTLS as Tier 2 engine between fetch and Hero - Created scrapers/tls-client.ts using cycletls Go subprocess - Updated scraper fallback order: fetch -> tls-client -> playwright - Added TLS_CLIENT_ENABLED env var (default: true in docker-compose) - Added shutdown handler in queue-worker for clean subprocess exit - Created comprehensive test suite (7 tests passing) - Updated README.md with scraper technologies table T1-Q: Switch html-to-markdown to firecrawl fork - Updated go.mod: JohannesKaufmann v1.6.0 -> firecrawl latest - Updated import paths in html-to-markdown.go to firecrawl fork - Benefits: Fixes nested div in code blocks, performance improvements, new plugins (YouTube/Vimeo iframe conversion, robust code blocks) Performance Impact: - CycleTLS bypasses 60-80% bot protection without browser overhead - firecrawl fork improves large HTML conversion and code block handling
Rewrites viewer URLs to their export formats for direct HTML fetching: - Google Docs /edit → /export?format=html - Google Slides /edit → /export?format=html - Google Sheets /edit → /gviz/tq?tqx=out:html (preserves gid tab param) - Google Drive /file/d → /uc?export=download Skips already-published /d/e/ URLs which are publicly accessible. All 12 test cases pass covering edit links, query params, hash fragments, and edge cases. Task: T1-U
Added NODE_EXTRA_CA_CERTS=/etc/ssl/certs/ca-certificates.crt to docker-compose.yaml environment to fix 'unable to get local issuer certificate' errors when making HTTPS requests from containers. This fixes undici and axios TLS verification failures without disabling certificate validation entirely. Related: T1-U testing revealed HTTPS certificate issues in Docker environment
T1-R: Added HTTP 415 (Unsupported Media Type) to fallback loop break conditions alongside 404 and 500. Prevents unnecessary retries when server rejects content type. Updated log message to show actual status code. T1-T: Added SITEMAP_URL_LIMIT constant (10_000) to cap URLs from sitemap processing. Uses Math.min(this.limit, SITEMAP_URL_LIMIT) to prevent memory spikes from huge sitemaps while respecting user-defined crawl limits. Tests: 3 new tests for terminal status codes (404/415/500) Tasks: T1-R, T1-T
… HTML Adds a guard to scrapeController that rejects requests with an 'actions' field, returning 400 with a helpful message explaining that browser actions are not yet supported. Also adds a MAX_HTML_FOR_MARKDOWN constant (300KB) to skip markdown conversion for very large HTML responses, improving performance and avoiding memory issues. Unit tests verify both behaviors work correctly. Task: T1-X, T1-S
Permit per-domain engine selection via FORCED_ENGINE_DOMAINS env var. Supports exact domain matching and wildcard subdomains (*.example.com). Forces specific scrapers (fetch/playwright/tls-client) for matched URLs without code changes. Task: T2-A Files changed: - apps/api/src/scraper/WebScraper/utils/engine-forcing.ts (new) - apps/api/src/scraper/WebScraper/utils/__tests__/engine-forcing.test.ts (new) - apps/api/src/scraper/WebScraper/single_url.ts (forced engine logic) - apps/api/src/services/queue-worker.ts (init call)
… handling Adds proxy: 'basic' | 'stealth' | 'enhanced' parameter to control scraping engine (fetch for basic, Hero for stealth/enhanced). Forwards scrapeId to Hero service for log correlation. Maps network errors to specific status codes (0 for DNS/refused reset, 408 for timeouts) and replaces opaque 'Internal server error' messages with actionable error text. Task: T1-V, T1-W, T1-Y, T1-Z
Updates Hero browser engine to latest alpha.34 release (Sep 2025): - Chrome 139 emulation for better bot detection evasion - TLS fingerprint fix for Chrome 133+ - SOCKS5 proxy connection fix (PR firecrawl#378) - All 6 @ulixee packages upgraded to 2.0.0-alpha.34 Task: T2-F
Adds queueDurationMs (time spent in queue before worker processing) to scrape response metadata to help operators distinguish site slowness from server overload. Uses BullMQ's processedOn and timestamp properties from the job queue, which accurately reflect queue wait time. The value is a non-negative integer in milliseconds. Task: T1-AA
Screenshots are time-sensitive — a cached screenshot from an hour ago may show completely different page content. When screenshot or fullPageScreenshot is in pageOptions, skip the cache read entirely so a fresh scrape is always taken. Non-screenshot requests are unaffected. Task: T1-AB
Adds isInternalHost() function to validateUrl.ts that blocks: - Private IPv4 ranges (10.0.0.0/8, 172.16-31/12, 192.168.0.0/16) - Loopback addresses (127.0.0.0/8, ::1) - Link-local addresses (169.254.0.0/16) - Docker-internal hostnames (localhost, redis, worker, playwright-service, api) Integration: checkUrl, checkAndUpdateURL, and checkAndUpdateURLForMap now call isInternalHost() and throw "URL hostname is not allowed" when blocked. Zod schema updated to allow internal URLs through to checkUrl() refinement. All 50 unit tests pass (including 17 new SSRF protection tests). Task: T1-AG
Wraps Hero Core initialization with a configurable timeout (default 30 s) to prevent silent hangs. If initialization fails to complete within the timeout, the service exits immediately with a clear error message so the container can restart automatically. Task: T1-AC
Adds ignoreCache boolean parameter to /v1/map endpoint for API compatibility. Clients can opt out of any future URL caching by sending ignoreCache: true. The parameter defaults to false to preserve existing behavior. Task: T1-AE
… URLs Adds specific error message when screenshot format is requested for PDF or binary document URLs. Previously, users would see an empty screenshot field with a generic PDF error, which was confusing. Now they get a clear, actionable message recommending use of markdown format with parsePDF: true. Task: T1-AH
… URLs Add new `sitemapOnly: boolean` parameter (default: false) to crawl options. When true, the crawler only processes URLs from the site's sitemap. If no sitemap is found, the crawl returns a 400 error instead of falling back to a full link-following crawl. Behavior changes: - sitemapOnly=false (default): try sitemap, fallback to full crawl (existing) - sitemapOnly=true with sitemap: only scrape sitemap URLs - sitemapOnly=true without sitemap: return 400 error This gives users control over crawl behavior for structured sites where they only want explicitly published URLs, not all discovered links. Task: T2-G
Add new `actions` field to scrape options with support for: - click: Click elements by CSS selector - type: Type text into input fields - wait: Pause execution for specified milliseconds - scroll: Scroll page up or down - screenshot: Capture screenshot mid-flow Action execution is only supported when Hero browser service is configured. Simple wait action is fully implemented. Click, type, scroll, and screenshot actions are placeholders (logged but not yet executed pending Hero API confirmation). Task: T2-L
Adds native support for Word documents (.docx and .doc) through binary fetching and conversion. Previously these files returned empty content. - DOCX: Uses mammoth to convert to HTML with good fidelity - Legacy .doc: Uses word-extractor for OLE binary format parsing - Both: Downloaded as binary arrays, parsed before text extraction - Integration: Properly hooks into WebScraper pipeline Fixed critical bugs preventing document scraping: 1. processLinks() was filtering out docLinks and pdfLinks before processing 2. fetch.ts checked document types AFTER first HTTP request Task: T2-J, T2-K
Skips cache entries that are too recently cached (< minAge ms old) and forces a fresh re-scrape, giving callers control over cache freshness thresholds rather than just TTL expiration. Task: T2-N
This change adds RunPod pod ID logging and OOM kill detection without modifying any behavior when not running in RunPod, keeping the signal handler backward compatible with traditional deployments. The logging helps operators understand termination reasons in cloud environments where SIGTERM often indicates resource exhaustion. This provides better visibility into process lifecycle without adding overhead to non-RunPod environments. Task: T2-O
Previously spreadsheets returned empty/binary content; now extract structured data from .xlsx/.xls/.ods files using SheetJS. Early-exit fetch block with binary parsing, plus content-type guard prevents Cloudflare redirects being misinterpreted. Task: T2-I
The Dockerfile had two consecutive FROM blocks with no multi-stage naming, making the first stage (lines 1-21) dead code that was silently overridden by the second FROM. Removed the dead stage, keeping the effective build unchanged. Build verified: image builds cleanly and /health returns 200.
Adds a Renovate config to firecrawl-simple/ so it is picked up by --platform=local. Pins Node.js Docker images to LTS 22 via the workarounds:nodeDockerVersioning preset and allowedVersions:^22 — blocks upgrades to non-LTS versions (23, 25) while flagging node:18 in puppeteer-service-ts as requiring an 18→22 upgrade (EOL April 2025). Also removes the hardcoded GitHub PAT from the root renovate.json.
Add 'html' and 'links' to the Format type and Zod enum — previously both were documented in the OpenAPI spec but rejected with 400 by the validator. Add includeHtml and onlyMainContent to PageOptions and wire them through legacyScrapeOptions so each format is independently gated. - Default formats changed from ["markdown","rawHtml"] to ["markdown"] to match the upstream API contract - html and rawHtml now have independent includeHtml / includeRawHtml flags; previously both were gated by the single includeRawHtml flag - links key is now absent from the response when not requested (was always emitted as [] via legacyDocumentConverter, leaking an unexpected key) - Pruning in scrape.ts and crawl-status.ts uses !== undefined so empty strings are correctly removed, not silently kept - OpenAPI spec updated: crawl scrapeOptions enum and default corrected, html field added to crawl response schema
Three bugs combined to return <!DOCTYPE html> in the markdown field for sites with GDPR consent walls (heise.de and similar German publishers). 1. MAX_HTML_FOR_MARKDOWN bypass (single_url.ts): pages over 300 KB had cleanedHtml returned as-is instead of being converted, putting raw HTML directly into document.markdown. Changed to truncate-then-convert so parseMarkdown() is always called. Threshold raised to 500 KB. 2. No CMP container removal (removeUnwantedElements.ts): consent overlays from Sourcepoint (#sp-cc), OneTrust, Cookiebot, Didomi, Usercentrics, Quantcast, TrustArc and 10+ other CMPs passed through untouched. Added a cmpSelectors list that removes these unconditionally. Generic selectors are scoped to div/section to avoid matching root <html> elements that carry CMP data attributes as JS signals (not as overlay containers). 3. No markdown quality guard (single_url.ts): added isRawHtml() — if parseMarkdown() returns a string starting with <!DOCTYPE or <html, it is treated as empty string rather than stored in document.markdown. Also restored onlyMainContent filtering (nav/footer/sidebar/ads stripped when onlyMainContent=true, which is the default from the Zod schema). The default in the single_url.ts normalisation block was incorrectly ?? false; changed to ?? true. This reduces html field size by ~57% on heise.de.
check.py was purely observational — it printed field values but never exited non-zero, so all shell tests reported passed regardless of the actual response. Rewrote check.py to accept format arguments and assert: requested formats are present and non-empty, non-requested formats are absent, markdown never contains raw HTML, and html is smaller and script/style-free compared to rawHtml. Rewrote 01-positive-tests.sh with 12 asserting tests (P-01 to P-12) covering every format in isolation and combination, including: - P-07: html must be strictly smaller than rawHtml and script/style-free - P-10: markdown-only → links key must be absent (not even empty []) - P-12: default request → only markdown, no html/rawHtml/links Added 5 new Jest unit tests to single_url.test.ts: - consent wall guard: parseMarkdown returning <!DOCTYPE → markdown = '' - includeMarkdown/includeHtml/includeRawHtml field isolation - html has script/style stripped and is smaller than rawHtml - updated 300 KB bypass test to reflect new truncate-then-convert behaviour
Add Traefik labels and external network membership (docker-proxynet, traefik-servicenet) so the API is reachable via the fc-dev subdomain with automatic TLS. Bind the host port to 127.0.0.1 only. Reduce MAX_RAM from 0.95 to 0.60 and MAX_CPU from 0.95 to 0.40 to avoid starving the host. Switch puppeteer-service and api from pre-built images to local builds so docker compose build always produces fresh images from source.
… on 404 Replace the Googlebot User-Agent with a real Chrome 124 browser UA in all four fetch() request paths (HTML, DOCX, XLSX, .doc) to prevent sites like heise.de and macwelt.de from returning bot-blocking 404 responses. Remove 404 from the terminal status list so the scraper fallback chain (tls-client -> playwright/Hero) continues on 404 instead of aborting; 415 and 500 remain terminal. A debug log message is added to make bot-blocking 404s visible in the logs.
…l-dev to firecrawl
Author
|
FYI: I did too many changes for a PR, and it looks like this project is not maintained anymore. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi,
I know that this pull request is quite unorthodox, as I have too many different features included.
Please hear me out: I tried to get firecrawl-simple updated with fixes, which have been applied to the original firecrawl repository. This is quite a journey, as there are nearly 2400 commits since this fork was done.
I reduced the amount by filtering out:
I got to a point where I had 900 commits, which then splitted into several categories, mainly performance, html conversion, bug-fixes, library updates, security fixes etc.
This is my first set of commits which I have migrated. Most prominent are the redis updates to use lazy loading as most modern implementations.
I will be fine when this PR is not merged, but please take a look at my commits.
Git log:
commit 66a869c (HEAD -> main)
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 20:29:08 2025 +0200
commit 9381721 (origin/main, origin/HEAD)
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 15:42:02 2025 +0200
commit fa316b2
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 15:33:28 2025 +0200
commit c000568
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 15:24:38 2025 +0200
commit d238fdd
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 14:51:22 2025 +0200
commit 4480a2e
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 14:46:48 2025 +0200
commit fe075a4
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 14:46:32 2025 +0200
commit c4aa67f
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 12:51:48 2025 +0200
commit 9da8569
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 12:41:34 2025 +0200
commit 9da8569
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 12:41:34 2025 +0200
commit 853dfee
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 10:43:10 2025 +0200
commit 7796f32
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 00:53:01 2025 +0200
commit af8f411
Author: Daniel Andersen daniel@danand.de
Date: Wed Sep 24 00:30:42 2025 +0200